Report: YIEDL Experiment 1 (Old vs. New Dataset)

Author

Joe (Degenius Maximus) Chow

Published

April 29, 2025

Version: 0.2


1 Introduction

For the Yiedl competition we distribute a daily dataset, with week-to-week targets, which has fewer features than the new dataset given to Numerai for their crypto competition. We should compare the performance of the two dataset under a variety of models and check if it makes sense to push the adoption of the new dataset in our competitions.

This report summarises the findings from the first experiment. The ultimate goal is to identify the key performance differences in models trainned using either the old (weekly) or new (daily) YIEDL datasets. For this experiment, I decided to use a large grid search instead of a few fine-tuned models. Here are my reasons:

  1. Fine-tuned models (trained based on my previous experience with YIEDL and Numerai) may introduce survivorship bias.
  2. Training models from a large grid search will likely results in three groups of models: underfitted, about right, and overfitted.
  3. These three groups of models can kind of simulate a real-world situation where YIEDL would receive predictions from newbie, intermediate and expert users.
  4. If most models trained using daily data show improved out-of-bag predictive performance, this could strongly indicate that daily data is effective, thus addressing the primary research question.

1.1 Results Overview (Long Story Short)

If you’re swamped with deadlines and your coffee’s ice-cold, here’s the straight-to-the-point rundown on the results and conclusion to keep you informed without the fuss!

Conclusion: most users should be able to achieve better model performance with daily data compared to weekly data.

2 Experiment Set-up

2.1 Datasets

The following two datasets from https://yiedl.ai/competition/datasets were used:

  1. YIEDL Weekly Data - dataset_weekly_2025_15.zip
  2. YIEDL Daily Data - dataset_daily_2025_15.zip

2.2 Training vs. Test Periods

  1. Training: 2018-04-27 to 2022-10-31
  2. Embargo: 2022-11-01 to 2022-12-31 (a two-month gap between training and test to avoid data leakage)
  3. Test: 2023-01-01 to 2025-04-06

2.3 Stats

  1. Training (Weekly) = 98090 samples from 2018-04-29 to 2022-10-30.
  2. Training (Daily) = 684853 samples from 2018-04-27 to 2022-10-31.
  3. Test (Weekly) = 144828 samples from 2023-01-01 to 2025-04-06.

2.4 Features and Targets

There are 1140 features and 2 targets (target_neutral and target_updown) in the datasets. Here is an example of the weekly training data:

             date symbol  feature_1  feature_2 target_neutral target_updown
           <Date> <char>      <num>      <num>          <num>         <num>
    1: 2018-04-29    BTC 0.02564103 0.00000000      0.7435897    0.02502580
    2: 2018-04-29    PRE 0.33333333 0.07692308      0.4871795   -0.02182056
    3: 2018-04-29    QSP 0.58974359 0.58974359      0.9487179    0.21941261
    4: 2018-04-29    ADA 0.84615385 0.71794872      0.4102564   -0.04118809
    5: 2018-04-29   TRAC 0.38461538 0.35897436      0.8205128    0.11419655
   ---                                                                     
98086: 2022-10-30    CVX 0.52785714 0.49785714      0.3200000   -0.01969928
98087: 2022-10-30    CVC 0.52785714 0.49785714      0.7500000    0.04850496
98088: 2022-10-30   DDIM 0.52785714 0.49785714      0.2142857   -0.04630442
98089: 2022-10-30   DIGG 0.52785714 0.49785714      0.6928571    0.03206955
98090: 2022-10-30   DAWN 0.52785714 0.90714286      0.6128571    0.02011033

2.6 Models

The models can be categorised into four groups:

  1. 1008 models trained with weekly data + target_neutral —> predict on weekly data
  2. 1008 mdels trained with daily data + target_neutral —> predict on weekly data
  3. 1008 models trained with weekly data + target_updown —> predict on weekly data
  4. 1008 models trained with daily data + target_updown —> predict on weekly data

Note: each model is a simple average ensemble from three runs with three different random seed.

3 Predictions

Here is an example of predictions from models trained with the neutral targets:

         date   symbol yhat_weekly yhat_daily
       <Date>   <char>       <num>      <num>
1: 2023-01-01     ATOM  0.46437995 0.43139842
2: 2023-01-01      LTC  0.20844327 0.21108179
3: 2023-01-01    MARSH  0.09498681 0.12664908
4: 2023-01-01     UNCX  0.51846966 0.51319261
5: 2023-01-01 UNISTAKE  0.30606860 0.32849604
6: 2023-01-01      TCT  0.01451187 0.01055409

Similarly, we can look at the predictions from models trained with the updown targets:

         date   symbol yhat_weekly  yhat_daily
       <Date>   <char>       <num>       <num>
1: 2023-01-01     ATOM  0.01370640  0.01613749
2: 2023-01-01      LTC  0.00382110  0.00447802
3: 2023-01-01    MARSH  0.00873456  0.00900494
4: 2023-01-01     UNCX  0.00514489  0.01064970
5: 2023-01-01 UNISTAKE  0.00944165  0.01341015
6: 2023-01-01      TCT  0.00662424 -0.13239994

4 Evaluation Metrics

4.1 Primary Metrics

4.1.1 Target Neutral - Spearman Correlation

For target_neutral, I calculated the date-wise Spearman correlation by comparing the predictions from weekly/daily models with the targets. Here is an example:

         date cor_weekly   cor_daily
       <Date>      <num>       <num>
1: 2023-01-01 0.08863752  0.08433144
2: 2023-01-08 0.02875134 -0.01425746
3: 2023-01-15 0.09336099  0.12751131
4: 2023-01-22 0.11488212  0.11423694
5: 2023-01-29 0.06747419  0.04651512
6: 2023-02-05 0.06433825  0.08407862

We can see that the mean correlation from daily models are higher (better) when compared to weekly models in this example:

4.1.2 Target Updown - RMSE

Similarly, here is an example of the date-wise RMSE evaluation for target_updown:

         date rmse_weekly rmse_daily
       <Date>       <num>      <num>
1: 2023-01-01   0.1931192  0.1804634
2: 2023-01-08   0.3309541  0.2766155
3: 2023-01-15   0.2755370  0.1963707
4: 2023-01-22   0.2757088  0.2634905
5: 2023-01-29   0.4911367  0.4936971
6: 2023-02-05   0.5768729  0.5564189

Using the summary() function, we can see that there are some outliers in the data:

      date             rmse_weekly          rmse_daily       
 Min.   :2023-01-01   Min.   :     0.19   Min.   :     0.17  
 1st Qu.:2023-07-26   1st Qu.:     0.33   1st Qu.:     0.25  
 Median :2024-02-18   Median :     0.41   Median :     0.31  
 Mean   :2024-02-18   Mean   : 13072.28   Mean   : 13072.22  
 3rd Qu.:2024-09-11   3rd Qu.:     0.57   3rd Qu.:     0.52  
 Max.   :2025-04-06   Max.   :228482.57   Max.   :228482.54  

The outliers are removed (based on the IQR rule) in order to get a meaningful plot. Here is the summary of the cleaned data:

      date                weekly           daily       
 Min.   :2023-01-01   Min.   :0.1931   Min.   :0.1657  
 1st Qu.:2023-07-26   1st Qu.:0.3195   1st Qu.:0.2435  
 Median :2024-02-18   Median :0.3689   Median :0.2890  
 Mean   :2024-02-18   Mean   :0.4098   Mean   :0.3294  
 3rd Qu.:2024-09-11   3rd Qu.:0.4577   3rd Qu.:0.3629  
 Max.   :2025-04-06   Max.   :0.9519   Max.   :0.9170  
                      NA's   :20       NA's   :21      

We can see that the mean RMSE from daily models are lower (better) when compared to weekly models in this example:

4.2 Secondary Metrics

Based on the primary metrics, the following secondary evaluation metrics were calculated for further analysis:

  1. Sharpe Ratio (for target_neutral - higher the better)
  2. Max Drawdown (for target_neutral - lower the better)
  3. Compound return (simulating reinvestment after each round) (for target_neutral - higher the better)
  4. Trimmed Mean RMSE for target_updown (i.e. 10% of both ends were removed - this was needed to remove outliers in RMSE)

5 Comparison (Neutral)

5.1 Mean Spearman Correlation

5.1.1 Hypothesis / Expectations

Models trained with daily data should have higher mean Spearman correlation when compared to those trained with weekly data.

5.1.2 Observations (Stats)

  1. No. of daily models with higher mean correlation = 1008 out of 1008 (100%)

  2. Range of weekly models’ mean correlation (cor_wkly):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1283  0.1370  0.1387  0.1382  0.1399  0.1423 
  1. Range of daily models’ mean correlation (cor_daily):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.1340  0.1433  0.1449  0.1443  0.1459  0.1479 
  1. Range of raw performance differences (cor_daily - cor_wkly):
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002133 0.005222 0.006134 0.006158 0.007129 0.012689 
  1. Range of percentage differences (%) (diff / cor_wkly):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.591   3.731   4.400   4.464   5.180   9.893 

5.1.3 Observations (Charts)


5.1.4 Result Table

Notes:

  1. depth = max_depth
  2. rsamp = subsample
  3. csamp = colsample_bytree
  4. round = round
  5. cor_wkly= mean correlation of weekly models’ predictions
  6. cor_daily= mean correlation of daily models’ predictions
  7. diff = cor_daily - cor_wkly (i.e. positive differences mean the daily models are better)
  8. p_diff= diff / cor_wkly * 100 percentage difference (%)


5.2 Sharpe Ratio

5.2.1 Hypothesis / Expectations

Models trained with daily data should have higher Sharpe ratio (based on date-wise Spearman correlation) when compared to those trained with weekly data.

5.2.2 Observations (Stats)

  1. No. of daily models with higher Sharpe ratio = 963 out of 1008 (95.54%)

  2. Range of weekly models’ Sharpe ratio (sharpe_wkly):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.608   1.756   1.817   1.804   1.856   1.963 
  1. Range of daily models’ Sharpe ratio (sharpe_daily):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.628   1.810   1.879   1.861   1.924   1.981 
  1. Range of raw performance differences (sharpe_daily - sharpe_wkly):
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.05193  0.03792  0.05631  0.05696  0.07685  0.16861 
  1. Range of percentage differences (%) (diff / sharpe_wkly):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -2.647   2.128   3.121   3.160   4.250   9.403 

5.2.3 Observations (Charts)


5.2.4 Result Table

Notes:

  1. depth = max_depth
  2. rsamp = subsample
  3. csamp = colsample_bytree
  4. round = round
  5. sharpe_wkly= Sharpe ratio of weekly models’ predictions
  6. sharpe_daily= Sharpe ratio of daily models’ predictions
  7. diff = sharpe_daily - sharpe_wkly (i.e. positive differences mean the daily models are better)
  8. p_diff= diff / sharpe_wkly * 100 percentage difference (%)


5.3 Max Drawdown

5.3.1 Hypothesis / Expectations

I would expect higher (worst) max drawdown from the daily models when compared to weekly models due to increased noise in daily data. Yet, the results have proved me wrong*. The daily models are actually better with lower max drawdown. Please see below.

5.3.2 Observations (Stats)

  1. No. of daily models with lower max drawdown = 876 out of 1008 (86.9%)

  2. Range of weekly models’ max drawdown (mdd_wkly):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01610 0.04690 0.05440 0.05453 0.06220 0.08410 
  1. Range of daily models’ max drawdown (mdd_daily):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00510 0.02718 0.03310 0.03475 0.03970 0.06360 
  1. Range of raw performance differences (mdd_daily - mdd_wkly):
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.06290 -0.03022 -0.02260 -0.01978 -0.01125  0.02480 
  1. Range of percentage differences (%) (diff / mdd_wkly):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 -84.52  -52.12  -42.03  -33.20  -22.24  112.56 

5.3.3 Observations (Charts)


5.3.4 Result Table

Notes:

  1. depth = max_depth
  2. rsamp = subsample
  3. csamp = colsample_bytree
  4. round = round
  5. sharpe_wkly= Sharpe ratio of weekly models’ predictions
  6. sharpe_daily= Sharpe ratio of daily models’ predictions
  7. diff = sharpe_daily - sharpe_wkly (i.e. positive differences mean the daily models are better)
  8. p_diff= diff / sharpe_wkly * 100 percentage difference (%)


5.4 Compound Return

5.4.1 Hypothesis / Expectations

Similar to Mean Spearman Correlation (as shown above), I expect models trained with daily data to show higher compound return when compared to those trained with weekly data.

5.4.2 Observations (Stats)

  1. No. of daily models with higher compound return = 1008 out of 1008 (100%)

  2. Range of weekly models’ Sharpe ratio (return_comp_wkly):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  354.4   403.3   413.3   410.4   420.7   435.9 
  1. Range of daily models’ Sharpe ratio (return_comp_daily):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  385.6   442.0   452.3   448.6   458.5   471.6 
  1. Range of raw performance differences (return_comp_daily - return_comp_wkly):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12.26   32.69   38.55   38.21   44.30   72.90 
  1. Range of percentage differences (%) (diff / return_comp_wkly):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.177   7.794   9.226   9.337  10.862  20.571 

5.4.3 Observations (Charts)


5.4.4 Result Table

Notes:

  1. depth = max_depth
  2. rsamp = subsample
  3. csamp = colsample_bytree
  4. round = round
  5. return_comp_wkly= Compound return of weekly models’ predictions
  6. return_comp_daily= Compound return of daily models’ predictions
  7. diff = return_comp_daily - return_comp_wkly (i.e. positive differences mean the daily models are better)
  8. p_diff= diff / return_comp_wkly * 100 percentage difference (%)


6 Comparison (Updown)

6.1 Trimmed Mean RMSE

6.1.1 Hypothesis / Expectations

Models trained with daily data should have lower RMSE when compared to those trained with weekly data. Since a few outliers are expected in the mean values, trimmed mean (i.e. 10% data in both ends removed) is used for this analysis.

6.1.2 Observations (Stats)

  1. No. of daily models with lower trimmed mean RMSE = 1008 out of 1008 (100%)

  2. Range of weekly models’ trimmed mean RMSE (rmse_wkly):

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4583  0.5473  0.6009  0.5956  0.6421  0.7439 
  1. Range of daily models’ trimmed mean RMSE (rmse_daily):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.4218  0.4712  0.5039  0.5050  0.5342  0.6428 
  1. Range of raw performance differences (rmse_daily - rmse_wkly):
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-0.14027 -0.10743 -0.09196 -0.09059 -0.07554 -0.03166 
  1. Range of percentage differences (%) (diff / rmse_wkly):
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-20.836 -16.974 -15.217 -15.030 -13.471  -6.203 

6.1.3 Observations (Charts)


6.1.4 Result Table

Here is the full table of mean correlation comparison.

Notes:

  1. depth = max_depth
  2. rsamp = subsample
  3. csamp = colsample_bytree
  4. round = round
  5. rmse_wkly= trimmed mean RMSE of weekly models’ predictions
  6. rmse_daily= trimmed mean RMSE of daily models’ predictions
  7. diff = rmse_daily - rmse_wkly (i.e. negative differences mean the daily models are better)
  8. p_diff= diff / rmse_wkly * 100 percentage difference (%)


7 Conclusions

  • A large grid search (1,008 combinations of different xgboost parameters) was used for this experiments.
  • Pairs of weekly and daily models (trained using the same parameters) were used to produce out-of-bag predictions on the same weekly test data (i.e. from 2023-01-01).
  • Early analysis of the performance comparison suggests that daily data improves out-of-bag predictive performance.

7.1 Summary (Target Neutral)

7.2 Summary (Target Updown)